Skip to content

Add model for Sesame TTS#36

Merged
Blaizzy merged 15 commits intoBlaizzy:mainfrom
lucasnewman:sesame
Mar 17, 2025
Merged

Add model for Sesame TTS#36
Blaizzy merged 15 commits intoBlaizzy:mainfrom
lucasnewman:sesame

Conversation

@lucasnewman
Copy link
Copy Markdown
Collaborator

@lucasnewman lucasnewman commented Mar 15, 2025

Support for the Sesame TTS model, based on the official implementation and pre-trained model here.

Example usage:
python -m mlx_audio.tts.generate --model lucasnewman/csm-1b-mlx --play --text "Hello from Sesame."

TODO:

  • Basic architecture support
  • Pytorch weight loading
  • Audio generation loop running
  • Debug output to match Pytorch
  • Watermarking
  • Load weights as safetensors
  • Integrate into CLI generation

I'll save quantization support as a follow-up since it's not really my area of expertise.

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 15, 2025

Great job @lucasnewman, I love the speed! 🚀

The mimi codec will be really useful for some models on our roadmap.

FYI, there is an existing repo you can take inspiration from:

https://github.com/senstella/csm-mlx

I will help out when I return from vacation this coming week.

@lucasnewman
Copy link
Copy Markdown
Collaborator Author

Thanks for the reference! Basic audio gen is working now -- here's a sample.

@lucasnewman lucasnewman marked this pull request as ready for review March 16, 2025 03:54
@lucasnewman lucasnewman changed the title [WIP] Add model for Sesame TTS Add model for Sesame TTS Mar 16, 2025
@lucasnewman
Copy link
Copy Markdown
Collaborator Author

@Blaizzy I don't know if you want to use the (unquantized) model I uploaded to HF or another repo -- it's up to you! This is what the output looks like:

Model: lucasnewman/csm-1b-mlx
Text: Hello from Sesame.
Voice: af_heart
Speed: 1.0x
Language: a
==========
Audio generated successfully, saving to audio!
==========
Duration:              00:00:01.280
Samples/sec:           18930.8
Prompt:                33 tokens, 20.3 tokens-per-sec
Audio:                 30720 samples, 18930.8 samples-per-sec
Real-time factor:      1.27x
Processing time:       1.62s

The voice, speed, & language aren't applicable here but I was trying to be as surgical as possible with the model loading / generate changes. Feel free to change it up to whatever you'd like.

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 16, 2025

This is phenomenal @lucasnewman, you crushed it! 🔥

I will review and merge tomorrow. As well as, handle the quantization.

@Blaizzy I don't know if you want to use the (unquantized) model I uploaded to HF or another repo -- it's up to you! This is what the output looks like:

Could you upload your copy to mlx-community with the name:
mlx-community/csm-1b-bf16

and update path utils.py

The voice, speed, & language aren't applicable here but I was trying to be as surgical as possible with the model loading / generate changes. Feel free to change it up to whatever you'd like.

I'm thinking about a general API design. For instance, in my view ref_audio == voice, just that depending on the model it will alternate between text and a path to an audio. But I will save refactoring to v0.1.0 when we get STS up and running.

@lucasnewman
Copy link
Copy Markdown
Collaborator Author

Could you upload your copy to mlx-community with the name: mlx-community/csm-1b-bf16

The base model is fp32, not bf16, so I'll put it at mlx-community/csm-1b 👍

@lucasnewman
Copy link
Copy Markdown
Collaborator Author

I'm thinking about a general API design. For instance, in my view ref_audio == voice, just that depending on the model it will alternate between text and a path to an audio. But I will save refactoring to v0.1.0 when we get STS up and running.

Yep, that makes sense. We'll need some kind of voice_caption / voice_text parameter for Sesame and models like F5-TTS, since they require the caption alongside the reference audio.

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 17, 2025

Could you upload your copy to mlx-community with the name: mlx-community/csm-1b-bf16

The base model is fp32, not bf16, so I'll put it at mlx-community/csm-1b 👍

Sure, that makes sense.

I'm used to converting it to bf16.

@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 17, 2025

I'm thinking about a general API design. For instance, in my view ref_audio == voice, just that depending on the model it will alternate between text and a path to an audio. But I will save refactoring to v0.1.0 when we get STS up and running.

Yep, that makes sense. We'll need some kind of voice_caption / voice_text parameter for Sesame and models like F5-TTS, since they require the caption alongside the reference audio.

Got it, let me check a few things and come back with some suggestions

@Blaizzy Blaizzy merged commit 267d61e into Blaizzy:main Mar 17, 2025
1 check passed
@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Mar 17, 2025

Merged! 🚀

@lucasnewman lucasnewman deleted the sesame branch March 19, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants